The debates that occur in the Canadian Parliament are transcribed and published in a document known as Hansard. Since 2006, Hansard has been available for public download in .xml
file format. To date, this archive contains over 57 million words. By converting these transcripts from .xml
to .txt
, the files can be processed in numerous ways with the Python coding language.</p>
The purpose of this document is to illustrate the counting of specific words in the collection of text files that will be referred to from now on as the corpus</span>. Terms that may be unfamiliar to the reader are highlighted in bold, and contain a definition that can be accessed by hovering the mouse pointer over the word. While it is not necessary to read every line of code to understand the process, explanatory sections within the code are marked with a # and coloured light blue.
Before we can begin working with a piece of text, it must be loaded into and read by Python. Rather than altering (and potentially irreversibly changing) our original text file, we will work with the contents of the file as a string of text contained within a variable. Now we can close our original text file, keeping it intact.
In the next piece of code we will open the file that contains the textual transcripts of all of the Parliamentary debates that occured in the House of Commons during the 39th sitting of Parliament.
While working with one long string may be useful for other applications, our purposes require that we split the text into pieces. In this case, each word will become it's own unique string.
In [1]:
# 1. open the text file
infile = open('data/39.txt')
# 2. read the file and assign it to the variable 'text'
text = infile.read()
# 3. close the text file
infile.close()
# 4. split the variable 'text' into distinct word strings
words = text.split()
Now that we have loaded our file, we can begin to work on it. Python offers us a lot of pre-built tools to make the task of coding easier. Some of the most commonly used tools are known as functions. Functions are useful for automating tasks that would otherwise require a repetitive amount of coding. While Python has many built-in functions, the language's true power comes from the ability to define unique functions based on programming needs. In the code above, we've already used four Python functions: open
, read
, close
, and split
.</p>
In the next piece of code we will define our own funtion, called count_in_list
. This function will allow us to count the occurence of any word in the corpus.
In [2]:
# 5. define the'count_in_list' function
def count_in_list(item_to_count, list_to_search):
"Counts the number of a specified word within a list of words"
number_of_hits = 0
for item in list_to_search:
if item == item_to_count:
number_of_hits += 1
return number_of_hits
Now we can call the function for any word we choose. The next example shows that there are 392 occurences of the word privacy
contained in the transcripts for the 39th sitting of Parliament.
In [3]:
# 6. here the function counts the instances of the word 'privacy'
print "Instances of the word \'privacy\':", (count_in_list("privacy", words))
Unfortunately, there are two distinct problems here, centred around the fact that our function is only counting the string privacy
exactly as it appears.</p>
The first problem is that text strings are case-sensitive. If the word contains UPPERCASE and lowercase letters, the word that is searched for will only be counted if the cases match exactly. The following example counts the number of instances of Privacy
with the first letter capitalized.
In [4]:
print "Instances of the word \'Privacy\':", (count_in_list("Privacy", words))
Here is a more extreme example to illustrate the point.
In [5]:
print "Instances of the word \'pRiVaCy\':",(count_in_list("pRiVaCy", words))
The second problem is that of punctuation. Much like words are case-sensitive, they are also punctuation-sensitive. If a piece of punctuation has been included in the string, it will be included in the search. Here we count the occurrences of privacy,
shown here with a comma after the word.
In [6]:
print "Instances of the word \'privacy,\':", (count_in_list("privacy,", words))
And here we count privacy.
, with the word followed by a period.
In [7]:
print "Instances of the word \'privacy.\':",(count_in_list("privacy.", words))
We could comb through the text to find all of the different instantiations of privacy
, and then run the code for each one and add together all of the numbers, but that would be time consuming and potentially inaccurate. Instead, we must process the text further to make the text uniform. In this case we want to make all of the characters lowercase, and remove all of the punctuation.</p>
Python has a function that will do this for us.
We will reload the text file, split the text into distinct words or tokens and then use the the text cleaning function.
In [8]:
infile = open('data/39.txt')
text = infile.read()
infile.close()
tokens = text.split()
#here we call the text cleaning function
words = [w.lower() for w in tokens if w.isalpha()]
Now, when we count the instances of privacy
, we are presented with a total of 846 instances.
In [9]:
print "Instances of the word \'privacy\':", (count_in_list("privacy", words))
Now let's see how this compares to the rest of the corpus. To accomplish this, we must write another function that will read all of the text files in our file folder.
First we need to introduce a function of Python that we've yet to see: modules. Modules are packages of functions and code that serve specific purposes. These are much like functions, but more complex.
The next piece of code imports a module called os
, specifically the function listdir
. We will use listdir
to print a list of all the files in a specific directory. Each of the listed files corresponds to a textual transcript of Hansard. The first nine files refer to the complete transcript for each year from 2006 to 2014, while the last three files are the transcripts corresponding to each sitting of Parliament, in this case the 39th through to the end of the second sitting of the 41st Parliament.
In [10]:
# imports the os module
from os import listdir
# calls the listdir function to list the files in a specific directory
listdir("data")
Out[10]:
Although we can display the contents of a directory by using the listdir
function, Python needs those names stored in a list in order to iterate over it. We also want to specify that only files with the extension .txt
are included. Here we create another function called list_textfiles
.
In [11]:
def list_textfiles(directory):
"Return a list of filenames ending in '.txt'"
textfiles = []
for filename in listdir(directory):
if filename.endswith(".txt"):
textfiles.append(directory + "/" + filename)
return textfiles
Rather than writing code to open each file individually, we can create another custom function to open the file we pass to it. We'll call this one read_file.
In [12]:
def read_file(filename):
"Read the contents of FILENAME and return as a string."
infile = open(filename)
contents = infile.read()
infile.close()
return contents
Now we can open all of the files in our directory, strip each file of uppercase letters and punctuation, split the whole of each text into tokens, and store all the data as separate lists in our variable corpus
.
In [13]:
corpus = []
for filename in list_textfiles("data"):
# reads the file
text = read_file(filename)
# splits the text into tokens
tokens = text.split()
# removes the punctuation and changes Uppercase to lower
words = [w.lower() for w in tokens if w.isalpha()]
# creates a set of word lists for each file
corpus.append(words)
Let's check to make sure the code worked by using the len
function to count the number of items in our corpus
list.
In [14]:
print"There are", len(corpus), "files in the list, named: ", ', '.join(list_textfiles('data'))
Let's create a function to make the names of the files more readable. First we'll have to strip the the file extension .txt
.
In [15]:
from os.path import splitext
def remove_ext(filename):
"Removes the file extension, such as .txt"
name, extension = splitext(filename)
return name
for files in list_textfiles('data'):
remove_ext(files)
Now let's make a function to remove the data/
.
In [16]:
from os.path import basename
def remove_dir(filepath):
"Removes the path from the file name"
name = basename(filepath)
return name
for files in list_textfiles('data'):
remove_dir(files)
And finally, we'll write a function to tie the two functions together.
In [17]:
def get_filename(filepath):
"Removes the path and file extension from the file name"
filename = remove_ext(filepath)
name = remove_dir(filename)
return name
filenames = []
for files in list_textfiles('data'):
files = get_filename(files)
filenames.append(files)
Now we can display a readable list of the files within our directory.
In [18]:
print"There are", len(corpus), "files in the list, named:", ', '.join(filenames),"."
The next step involves iterating through both lists: corpus
and filenames
, in order to generate a word frequency for each file in the corpus. For this we will use Python's zip
function.
In [19]:
for words, names in zip(corpus, filenames):
print"Instances of the word \'privacy\' in",names, ":", count_in_list("privacy", words)
What's exciting about this code is that we can now search the entire corpus for any word we choose. Let's search for information
.
In [20]:
for words, names in zip(corpus, filenames):
print"Instances of the word \'information\' in",names, ":", count_in_list("information", words)
How about ethics
?
In [21]:
for words, names in zip(corpus, filenames):
print"Instances of the word \'ethics\' in",names, ":", count_in_list("ethics", words)
While word frequencies, by themselves, do not give us a tremendous amount of contextual information, they are a valuable first step in conducting large scale text analyses. For instance, returning to our frequency list for privacy
, we can observe a general trend suggesting that the use of privacy
has been increasing between 2006 and now. It is important to note that our calculations are a raw number. For a more contextual analysis we could calculate how many times the Parliament was in session during each period, or perhaps we could compare the word privacy
to the total amount of words in each file.</p>
Stay tuned for the next section: Adding Context to Word Frequency Counts